-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ballista shuffle is finally working as intended, providing scalable distributed joins #750
Conversation
@houqp @Dandandan @edrevo @alamb @jorgecarleitao Ballista is finally working with scalable distributed joins, at least it is for TPC-H. I plan on following up with some further smaller code cleanup PRs now that the functionality is working. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code -- while I am not a ballista expert it seems reasonable to me.
One thing I did notice was that there don't appear to be any new / updated tests in this PR.
@@ -69,7 +78,8 @@ impl ExecutionPlan for UnresolvedShuffleExec { | |||
} | |||
|
|||
fn output_partitioning(&self) -> Partitioning { | |||
Partitioning::UnknownPartitioning(self.partition_count) | |||
//TODO the output partition is known and should be populated here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this something that you want to finish up in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed https://github.com/apache/arrow-datafusion/issues/758 as a follow-up for implementing this since it involves more serde work.
I've added an additional test to check that TPC-H query 12 gets planned with correct partitioning information in the shuffle readers. |
🎉 |
Which issue does this PR close?
Builds on #738 Closes #707.
With this PR we finally have scalable distributed joins.
Query 12 performance at SF=100
Integration tests pass.
Rationale for this change
This is making Ballista work as it was intended to work.
What changes are included in this PR?
Tons of bug fixes around shuffles.
Are there any user-facing changes?
No